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1 Abstract 


The Game of Life is a cellular automata in which cells exemplifies deterministic chaos. Running the game of life backwards in 
time is a problem which has received little attention. In this paper we investigate the application of various machine learning 
techniques reverse the game state by one step. We used both an MCMC based approach and a technique with iterated MAP 
estimates of cells. Results were validated by simply running the boards in the forward direction. The iterated MAP approach 
worked well and converged quickly, especially with sparsely populated boards. On more chaotic, densely populated boards 
the algorithm reached a good approximation. Simulated annealing was ineffective due to the high dimensionality of the search 
space. 


2 Introduction and Motivation 


The game of life (GoL) is a cellular automata created by John Conway. The board consists of cells that are each either alive 
(1) or dead (0). The next state of the board is then determined by simple rules: (a) Any live cell only remains lives if it has 
two or three live neighbors because of under/over population. (b) Any dead cell with exactly three live neighbors becomes 
a live cell through reproduction. (c) Otherwise, the cell is dead. But how does one run it in the reverse direction? Kaggle 
poses this as a question: “If machine learning (or optimization, or any method) can predict the game of life in reverse.” [5]. 


A naive approach to the problem is a brute force solution but the state space grows exponentially larger with a bigger board. 
In the case of the Kaggle competition data, the boards are 20 x 20 which means there are 29 possible boards which is 
computationally impossible to compute directly and so must turn to machine learning and statistics. 


The only directly related work online or in literature was a blog post and video by Bickford [2]. Bickford proposes a number 
of different heuristics in which the board is broken up into smaller pieces but as the author points out, it guickly runs into 
difficulties with larger patterns and took "several days to run” [2]. 


Another attractive alternative is to look towards the field of image analysis instead. Many analogies can be drawn between 
the challenge at hand and binary pixel classification. Each cell is classified into one of two states. Both boards and images can 
be represented by a 2— D grid. A popular and natural method for binary pixel classification employ Markov Random Fields 
(MRFs) [1, 7, 6]. However, the analogies end here. The primary goals of research in image analysis is in image denoising, 
image reconstruction, and smoothing which assume certain properties not present in the system of GoL, such as smoothness 
between neighboring pixels. 


There are also important observations/factors inherent to the particular instance of GoL. Namely, the boards are 20 x 20 
and have un-stitched edges. In addition, the boards have all been “warmed up” for five iterations which means certain board 
configurations are impossible. 


The methods we describe below are solutions to the problem of finding an exact predecessor state to a Game of Life board. 


3 Models and Methods 


3.1 Mathematical Definition 


Let Y represent a future board state, and X represent the initial board state. Y is a vector with Y; € {0,1}. X is a vector with 
X; € [0,1]. X; represents the probability that the cell is alive. n; represents the eight neighboring cells of X;. The function 
pk(ni) is the probability that exactly k members of the set n; are alive. Finally, let Ce be the set of all k-combinations of n;. 
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Using the GoL rules outlined in the introduction, we give the likelihood function of board Y. 


P(ylx) = Il Bern(y; ; ps (ni) + zipo (n;)) (2) 
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3.2 MAP over the entire board 


Our first approach was to find the maximum a posteriori (MAP) estimate of the state x. By imposing a prior that makes 
impossible boards less likely, we hypothesized that we could create an algorithm that could guickly converge to a valid state. 
There is no prior distribution conjugate to the likelihood, so the MAP estimate cannot be found analytically. 


Simulated annealing was employed to compute the maximum of the posterior distribution whose energy function is the log 
posterior probability. To compute the posterior probability of state x, we can write the log likelihood as shown below. 


log L(x) = $` yi log(pa(ni) + wipa(ns)) + (1 — ys) log(1 — pa(ns) — zipa(n;)) (3) 
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3.2.1 Dirichlet Prior 


For the MAP estimation to work better than MLE, the prior distribution must assign higher probability to states generated 
by the Game of Life. For each state, one can compute the number of live neighbors each cell has. The number of cells with a 
given number of neighbors seemed like a comparable statistic between the boards in the training set. Instead of using the raw 
counts, we added one to each count and normalized so that the sum is 0. The prior distribution over this sufficient statistic 
is a Dirichlet distribution. 


The parameter a for the Dirichlet distribution was found using MLE estimation using the training boards. We used gradient 
descent to maximize the likelihood of the training boards with respect to the parameter alpha [4]. 


Data: y 

Result: x 

initialize x; 

T := To; 

while not converged do 

xa 4 x +N (0,0); 

e' < log L(y|z") + log p(4'|0); 

e + log L(y|x) + log p(2|9); 

a — random number from 0 to 1; 


if a < exp (e) then 


T 
| rea’; 
end 
Ter- Ck; 
end 


3.3 MAP over each cell 


We saw that performing a random walk over the space of all possible states would not converge. Our next approach was to 
assume a starting state x and iteratively compute the MAP estimate for cell 7; given x_;. This can be computed with the 
assistance of root finding numerical methods. 


The posterior distribution we use here is a beta. The reasons for this is twofold: 1) the log of a beta distribution takes the 
same form as the likelihood function and 2) we can use the beta to add probability weight to 0 and 1 when a, 6 < 0. 


One problem with this technique is that by updating one cell at a time, the first cells to be randomly updated cells may be 
assigned to 0 or 1, which introduces a bias. We used the same method as simulated annealing by having a temperate value 
T which controls the prior parameters ageC*? and ByeC*?. During the early iterations, the probability mass of the prior is 
in the center, but as the model “cools down” after many iterations, the probability mass of the prior moves to 0 and 1. 


First we must show some properties of the function p. Assume cells i and j are neighbors and k > 0. We let nj —;i represent 
the set nj — {i}, or the set of neighbors of j excluding i. This follows from equation 1. 


Pr(nj) = Lipe—1(nj — tih) + (1 — vi)pe(n; 
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Now we can calculate the MAP for cell x;. Starting from equation 3, we compute the log posterior as a function of zi. 


- tih 


log p(wila—i,y) = $. [yy log(pa(n;) + zjp2(n;)) + (1 — yy) log(1 — pa(n;) — xjpa(n;))) 
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We find that we can compute be se posterior guickly and maximize it by Todi roots of the derivative: The derivative 


S a is of the form 


[0, 1]. Since it 


is bounded, we can use bisection or Neyon s method to ina the root, and thus the global maxima. The algorithm proceeds 


as follows. 
Data: y 
Result: x 
T := To; 
x := vector of random samples from Beta(T, T); 
while not converged do 
i — random cell; 
zi — £imap(z,y,T); 
Te TC,; 
end 


4 Results 


4.1 Simulated Annealing 


The first step was to use the training set to learn the parameter a for the prior distribution with gradient descent. These 


values correctly assign a higher probability to having fewer neighbors. 


Given the prior distribution, we ran 10,000 iterations of simulated annealing. This took about 8 hours to run and the state 
did not come close to converging. Reflecting on this approach, performing a random walk on a high dimensional space is 


unlikely to ever converge [3, p322-331]. 


4.2 MAP over cells 
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Figure 1: Left: Predicted start board, Right: Stop board 
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Figure 2: Left: Predicted start board, Right: Stop board 


The cell based algorithm was initially run against sparsely populated, simple configurations. Convergence of the start board 
cell probabilities to {0,1} was reached reached quickly. 


Observe that running the start board in the forward direction leads to the stop board, as intended. Shown in Fig 4.2, 4.2, 
left, are one of several start boards found which could generate the stop boards (Fig 4.2, 4.2 right). 


While the cell based algorithm performed well on small patterns, on the sample boards given in the Kaggle training set, the 
algorithm failed to converge on a state with discrete state assignments to each cell. The results of running on board 22 of the 
training set are shown in Figure 3. The top-left board is our resulting start board, and the top right board is our estimate 
advanced one time step. The bottom left is the actual start board and the bottom right is the actual stop board. 
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Figure 3: Top: Estimations, Bottom: Actual boards 


5 Conclusions 


We were able to successfully implement an iterated maximum a posteriori estimation approach to create optimal soft assign- 
ments to cells in the Game of Life. In the case of sparsely populated boards, MAP proved so effective that the final soft 
cell assignment values of being alive would converge onto integer values. Running the predicted initial boards in the forward 
direction, without further processing through the standard rules of GoL then yielded the original stop board. 


In the case of chaotic boards, MAP was less effective. As illustrated in Figure 3, the final values did not always converge to 0 
or 1. Instead the larger solutions would converge to a non-deterministic steady state. However, by rounding the probabilities 
and evolving the boards we produced results qualitatively similar to the desired boards. 


Simulated annealing was an attempt to use an MCMC method to find the maximum a posteriori estimate of the board without 
updating one cell at a time. This method was also attractive because it did not require a conjugate prior. Unfortunately, 
random walks in high dimensional spaces caused issues with convergence and we were not able to produce meaningful results. 
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6 Appendix 


data <- read.csv("./train.csv", header=TRUE, sep=",", stringsAsFactors=FALSE) 
size <- 20 


TO <- 10 
C.k <- .9999 
eps <- 0.00001 


for (row in c(22)) 4 
start <- as.numeric(data[row,3:402]) 
end <- as.numeric(data[row,403:802] ) 


J 


neighbors <- lapply(l:size"2, function (i) 4 
r <- (i - 1) %/% size 
c <- (i - 1) %% size 


n <- cQ) 
for (r.p in (r-1):(r+1)) { 
for (c.p in (c-1):(c+1)) 4 
if ((r.p == r && c.p == c) | 
| 


| 
r.p < 0 ll r.p >= size || 
c.p < 0 || c.p >= size) 4 
next 
} 
n <- append(n, r.p x size + c.p + 1) 
} 
} 
n 
b 


neighbor.probs <- function (x, i, without=NULL) 4 
cells <- neighbors[[i]] 
if ('is.null(without)) 4 
cells <- cells[-which(cells == without)] 
} 
sapply(cells, function (j) 4 x[j] }) 
} 


mle <- function(x, y, i, temperature) 1 
n.p <- lapply(neighbors[[i]], function (j) 4 
probs <- neighbor.probs(x, j, without=i) 
sapply(1:3, function (k) { pn(probs, k) }) 
b 


probs <- neighbor.probs(x, i) 
p <- sapply(1:3, function (k) 4 pn(probs, k) }) 


a <- mapply(function (j, p) 4 
c(x[j] x (pli] - p[2]) + p[2] - p[3], p[3] + x[j] x p[2]) 
}, neighbors[[i]], n.p, SIMPLIFY=F) 


alpha <- temperature 
beta <- temperature * 7 


if (temperature < 1) { 
log.likelihood <- function (x.i) 4 
lkh <- sum(mapply(function (j, v) 4 
pr <- max(0, min(1, c(x.i, 1) %x% v)) 
log(dbinom(y[j], 1, pr)) 
}, neighbors[[i]], a)) 
} 


maxima <- c(0, 1) 
values <- sapply(maxima, log.likelihood) 
if (max(values) == -Inf) 4 
log.posterior <- function (x.i) 4 
log.likelihood(x.i) + (alpha - 1) x log(x.i) + (beta - 1) * log(1 - x.i) 
} 
optimize(log.posterior, lower = eps, upper = 1 - eps, maximum = T)$maximum 


} 


else { 
maxima [which .max (values)] 
h 
} 
else { 
coefs <- mapply(function (j, v) { 
(v[2] - 1 + y[j]) / v[1] 
}, neighbors[[i]], a) 
coefs <- append(coefs, (p[3] - 1 + y[i]) / p[2]) 
coefs <- coefs[coefs != -Inf & coefs != Inf & !is.nan(coefs)] 
f <- function (x.i) 4 
dlkh <- sum(sapply(-coefs, function (c.i) { 
1 / (x.i - c.i) 
H) 
dlkh + (alpha - 1) / x.i + (beta - 1) / (x.i - 1) 
h 
uniroot(f, lower = eps, upper = 1 - eps)$root 
} 
} 


next.state <- function(x) { 
x <- round(x) 
sapply(1:length(x), function (i) { 
alive <- sum(sapply(neighbors[[i]], function (j) x[j])) 
if (alive == 3 | (alive == 2 & x[i])) 1 else 0 
b 


} 


find.mle <- function(y, n, x = NULL, temp = TO) 4 
if (is.null(x)) 4 
x <- rbeta(400, temp, temp * 7) 
} 
for (j in 1:n) { 
print (j) 
i <- sample(1:400, 1) 
x[i] <- mle(x, y, i, temp) 
temp <- temp * C.k 
} 
x 


} 


pn <- function(probs, n) { 
if (length(probs) < n) 4 
return(0) 
} 
sum(apply(combn(1:length(probs), n), 2, function (idx) 4 
prod(sapply(1:length(probs), function (i) 4 
dbinom(i /in/ idx, 1, probs[i]) 
H) 
H) 
} 


plot.nth <- function(grid, start, end) { 
par (mfrow=c(2,2)) 
image(matrix(grid, nrow=size, ncol=size), col=gray((32:0)/32)) 
image (matrix(next.state(grid), nrow=size, ncol-size), col=gray((32:0)/32)) 
image(matrix(start, nrow=size, ncol=size), col-gray((32:0)/32)) 
image(matrix(end, nrow=size, ncol=size), col=gray((32:0)/32)) 


